“Calhoun 


Institutional Archive of the Naval Postgraduate School 





Calhoun: The NPS Institutional Archive 
DSpace Repository 


Theses and Dissertations 1. Thesis and Dissertation Collection, all items 


2008-06 


Integrated data-driven DSS in a laboratory environment 


Hargrave, Brian L. 


Monterey California. Naval Postgraduate School 


http://hdl.handle.net/10945/4133 


Downloaded from NPS Archive: Calhoun 


Calhoun is the Naval Postgraduate School's public access digital repository for 


\§ D U DL EY research materials and institutional publications created by the NPS community. 
«iis Calhoun is named for Professor of Mathematics Guy K. Calhoun, NPS's first 


NY KNOX appointed -- and published -- scholarly author. 


LIBRARY Dudley Knox Library / Naval Postgraduate School 
411 Dyer Road / 1 University Circle 


http://www.nps.edu/library Monterey, California USA 93943 





NAVAL 
POSTGRADUATE 
SCHOOL 


MONTEREY, CALIFORNIA 


THESIS 


INTEGRATED DATA-DRIVEN DSS IN A LABORATORY 
ENVIRONMENT 


by 


Brian L. Hargrave 


June 2008 








Thesis Advisor: Daniel R. Dolk 
Second Reader: Albert Barreto 





Approved for public release; distribution is unlimited 





TH 


IS PAGE 








NTENT 





ONALLY LEFT BLANK 





REPORT DOCUMENTATION PAGE Form Approved OMB No. 0704-0188 


Public reporting burden for this collection of information is estimated to average 1 hour per 
response, including the time for reviewing instruction, searching existing data sources, gathering 
and maintaining the data needed, and completing and reviewing the collection of information. Send 
comments regarding this burden estimate or any other aspect of this collection of information, 
including suggestions for reducing this burden, to Washington headquarters Services, Directorate 
for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 
22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) 
Washington DC 20503. 


1. AGENCY USE ONLY (Leave blank) 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED 
June 2008 Master’s Thesis 


4. TITLE AND SUBTITLE Integrated Data-Driven DSS ina 5. FUNDING NUMBERS 
Laboratory Environment 
6. AUTHOR(S) Brian L Hargrave 


7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS (ES) 8. PERFORMING ORGANIZATION 
Naval Postgraduate School REPORT NUMBER 
Monterey, CA 93943-5000 

9. SPONSORING /MONITORING AGENCY NAME(S) AND 10. SPONSORING/MONITORING 

ADDRESS (ES) AGENCY REPORT NUMBER 
N/A 














11. SUPPLEMENTARY NOTES The views expressed in this thesis are those of the author and 
do not reflect the official policy or position of the Department of Defense or the U.S. 
Government. 


12a. DISTRIBUTION / AVAILABILITY STATEMENT 12b. DISTRIBUTION CODE 
Approved for public release; distribution is unlimited A 


13. ABSTRACT (maximum 200 words) 

Decision support technologies have remained individualistic as primarily stand alone 
platforms. The ability to access and integrate a wide range of such technologies in an 
Integrated Decision Technology Environment (IDTE) can potentially increase a user’s 
ability to create more complex decision support projects. A well-designed IDTE will 
allow users to identify, learn about, access, execute and integrate disparate decision 
technologies. Data-Driven DSS provide decision-makers with the capability to store and 
sort vast amounts of data by leveraging data warehousing and data-mining. These data- 
oriented decision technologies can assist decision-makers in making better and more 
informed decisions in shorter durations of time. 

This thesis focuses on Data-Driven data mining decision technologies and how they can 
be integrated into an IDTE. In the process of identifying data mining technology 
requirements, we first created a simple taxonomy characterized by the four categories 
of association, classification, clustering, and prediction. W then designed a 
database schema for storing the requisite data about data mining technologies, and case 
studies illustrating their use. Finally we designed a simple, yet effective, interfac 

for navigating through the data-driven decision technology universe both at NPS and 
beyond. SQL commands for populating the various screens of the IDTE interface were 
provided to show proof of concept. 


14. SUBJECT TERMS Data-Driven Technology, Data Mining, IDT! 15. NUMBER OF 
Integrated Decision Technology Environment, DSS PAGES 






































77 


16. PRICE CODE 


17. SECURITY 18. SECURITY 19. SECURITY 20. LIMITATION OF 
CLASSIFICATION OF CLASSIFICATION OF THIS CLASSIFICATION OF ABSTRACT 
REPORT PAGE ABSTRACT 
Unclassified Unclassified Unclassified UU 
NSN 7540-01-280-5500 Standard Form 298 (Rev. 2-89) 
Prescribed by ANSI Std. 239-18 








TH 


IS PAGE 








NTENT 





ONALLY LEFT BLANK 





nl 


Approved for public release; distribution is unlimited 


INTEGRATED DATA-DRIVEN DSS IN A LABORATORY ENVIRONMENT 


Brian L. Hargrave 
Lieutenant, United States Navy 
B.S., Old Dominion University, 2001 











Submitted in partial fulfillment of the 
requirements for the degree of 





MASTER OF SCIENCE IN INFORMATION TECHNOLOGY MANAGEMENT 


from the 


NAVAL POSTGRADUATE SCHOOL 


June 2008 
Author: Brian Hargrave 
Approved by: Daniel Dolk 








Thesis Advisor 


Albert “Buddy” Barreto 
Second Reader 





Dan Boger 
Chairman, Department of Information Sciences 








Tea 





TH 


IS PAGE 








NTENT 





ONALLY LEFT BLANK 





iv 


ABSTRACT 





Decision support technologies have remained 





individualistic as primarily stand alone platforms. The 








ability to access and integrate a wide range of such 





























technologies in an Integrated Decision Technology 
Environment (IDTE) can potentially increase a user’s ability 
to create more complex decision support projects. A well- 











designed IDTE will allow users to identify, learn about, 
access, execute and integrate disparate decision 
technologies. 











Data-Driven DSS provide decision-makers with the 
capability to store and sort vast amounts of data by 
leveraging data warehousing and data-mining. These data- 
oriented decision technologies can assist decision-makers in 
making better and more informed decisions in shorter 


durations of time. 








This thesis focuses on Data-Driven data mining decision 

















technologies and how they can be integrated into an IDTE. In 








the process of identifying data mining technology 
requirements, we first created a simple taxonomy 


characterized by the four categories of association, 





classification, clustering, and prediction. We then 





designed a database schema for storing the requisite data 


about data mining technologies, and case studies 





illustrating their use. Finally we designed a simple, yet 





effective, interface for navigating through the data-driven 


decision technology universe both at NPS and beyond. SQL 








commands for populating the various screens of the IDTE 





interface were provided to show proof of concept. 
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I. INTRODUCTION 


A. PROBLEM CHARACTERIZATION 


A decision support system is a system under the control 
of one or more decision makers that assists in the activity 
of decision making by providing an organized set of tools 


intended to impose structure on portions of the decision- 





making situation and to improve the ultimat ffectiveness 








of the decision outcome.! 





DSS (DSS) have become increasingly important in today’s 





society as computing technology moves from strictly 





operational functions to problem-solving activities. The 





amount of data that needs to be processed and transformed in 
order to make informed decisions has increased in concert 
with contemporary computers’ ability to store and process 
ever more vast collections of data. The ability to “make 


sense” out of exponentially increasing amounts of data makes 








the use of DSS, and associated decision technologies, a more 


vital cog in the information landscape. 





Data-driven DSS provide decision-makers with the 
capability to store and sort through vast amounts of data by 


leveraging data warehousing and data-mining. These data- 





oriented decision technologies can assist decision-makers in 
making better and more informed decisions in shorter 
durations of time. This thesis will focus on data-driven 
technologies, particularly data mining, and how they can be 


integrated with other decision technologies. 





1 George M. Marakas, Decision Support Systems In The 21°* Century, 2d 
ed. (Upper Saddle River: Prentice Hall, 2003), 4. 
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B. MOTIVATION 





DSS are interactive computer-based systems or 
subsystems intended to help decision makers use 
communications technologies, data, documents, knowledge 


and/or models to identify and solve problems, complete 





decision process tasks, and make decisions. DSS assist in 
the activity of decision-making by providing an organized 


set of tools to impose structure on portions of the 





decision-making process. Currently, integrated DSS 


environments with a portfolio of available decision 





technologies are largely nonexistent. While DSS tools are 





frequently used by both staff and students at NPS, there is 





no single portal or access point to available software. An 














integrated decision technology environment (IDTE) would 


allocate these available tools in a central environment for 





use. Further, capabilities would be provided to integrate 
various technologies in support of specific applications. 


Additionally, the IDTE will host self-tutorials, 














publications, and other web-content that will benefit users. 
This research will focus on data-driven decision 


technologies that will be included into the integrated 








Decision Support Laboratory environment. 


Cc. METHODOLOGY 


The approach will be to conduct a literature survey of 





data-driven DSS. The next step will be to identify and 


categorize different data-driven technologies. The 





methodology will continue with the development of a data- 








driven decision technology taxonomy that will be used in 








designing an interface for the IDTE. This taxonomy will be 


2 


enhanced via construction of an equivalent database schema 





for storing relevant information about data mining 
technologies and associated case studies. This database 


schema will drive the design of a conceptual user interface 














for the IDTE. An overarching use case will be generated 








describing the usage of the IDTL in the data-driven decision 


technology mode. 





Current resources from the areas of Data-driven 


technologies, data-mining, data warehousing, and web-based 





DSS are integrated in order to develop a conceptual 


framework for the IDTE. 








D. SCOPE AND LIMITATIONS 





This thesis will identify an approach to the 








preliminary design and development of an IDTE with special 


emphasis on data mining technologies as a subset of data- 





driven DSS. A working model will be designed but not 
implemented. This thesis will provide a conceptual view of 
the database models to be incorporated into future research 


and development of the IDTE. 








E. OUTLINE OF THESIS 











Chapter of this thesis describes data mining, data 





warehousing, and knowledge discovery technologies, including 

















an ontology in the form of a simple taxonomy. Chapter 





presents a conceptual data mining technology schema to be 














utilized in the development of the IDTE. Chapter IV presents 








a conceptual user interface design for the IDTE based upon 





the database schema and following a use case scenario. 


Chapter V provides a summary and recommendations for future 





enhancements of the IDTE. 


II. DATA WAREHOUSES AND DATA MINING 


A. GENERAL 





In this chapter, we examine two key concepts critical 


to the research. We will conduct a brief overview of data 








warehouses and data mining as they pertain to DSS. From this 
review, we will develop a taxonomy of data mining followed 


by a survey of available software at Naval Postgraduate 





School (NPS). The purpose of developing this taxonomy is to 





provide a framework for eventual users of the IDTE to access 








and use data mining software systems effectively. 





B. DATA WAREHOUSES 


A data warehouse iS a specific type of database 





designed to support managerial decision-making. A database 





is simply “a collection of interrelated data organized to 





meet the needs and structure of an organization and can be 
used by more than one person for more than one application.? 


A database by itself may not provide the user with useful 





decision-making knowledge, particularly if it is operational 





in nature. A standalone, operational database primarily 


provides a place for data to be stored in a structured way 





until needed for report generation and/or queries. It allows 











2 Efraim Turban and Jay E. Aronson, Decision Support Systems And 
Intelligent Systems, 5" ed. (Upper Saddle River: Prentice Hall, 1998), 
80. 











the user to access, view, and change data as required, 











mostly with the objective of maintaining timely and accurate 


data. 


A data warehouse, on the other hand, differs in its 





objectives. Data Warehouse definitions vary widely, but a 





serviceable version is “data warehousing combines data from 
multiple databases or data sources into a large database for 


the purpose of providing mor xtensiv information 





retrieval and data analysis.”2 A data warehouse is the 





backbone of a DSS. As shown in Figure 1, all information is 





taken from or put into the data warehouse for the primary 


purpose of knowledge discovery. 





3 Shon Harris, All In One CISSP Exam Guide, 377 ed. (New York: 
McGraw-Hill, 2005), 830. 


















Data 
warehouse 










Data 
sources 





DATA 
MINING 









Figure 1. Data Warehouse Architecture.? 


The Architecture also illustrates how the data is taken 
from multiple sources, formatted, cleaned (checked for 


integrity, validity, duplicity, and relevance), and 








refreshed as it is incorporated into the warehouse. Data 
Marts, as shown in Figure 1, are subsets of the data 
warehouse. The data mart can be a small snapshot of the 
larger data warehouse or it can be a repository of 
specialized information about a particular area of data. An 
example of a data mart is one of an enterprise that has a 
data warehouse but creates data marts that separate the 


different departments or regions of the company. This 


4 Catherine M. Ricardo, Databases Illuminated (Sudbury: Jones and 
Bartlett, 2004), 735. 
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provides the benefits of working on a smaller scale while 
limiting the damage a mistake will permeate through the 


warehouse. 


A data warehouse serves several functions beyond that 





of simple storage of data. The data warehouse is most useful 
when utilizing the data retrieval and analytical functions. 


As shown in Figure 1 previously discussed, these functions 








include DSS (DSS) using querying and reporting tools, On- 





Line Analytical Processing (OLAP), and data mining. The main 


differentiators of a data warehouse is that, via OLAP, it 








provides the ability to look at data multi-dimensionally, 








and via data mining to discover patterns across historical 


data. 


Querying and reporting are the simplest tasks for the 





data warehouse. The query is what it sounds like. It is a 























question that the user generates to the data warehouse. Th 








questions that are asked will be specific in nature. An 


example of a simple query that a user might ask a database 





is to list the Navy Personnel who are male. This will work 


as long as the database has a table of Navy Personnel and 





one of the attributes listed in that table is the gender of 


th servic member. This will return a list of all Navy 











personnel that are male. The queries can be more complex. 


These more complex queries look across several different 











tables in the database. An example of a complex query might 
be the same query as before but with additional 
restrictions, such as a list of Navy personnel who are male 


and participating in the Thrift Savings Plan (TSP). The 





query would ask the database to search the table Navy 





Personnel for male personnel and search the table TSP for 


the male Navy personnel. There would have to be a link that 
connects the information from one table to the other. 


Generally, this is some identifiable unique element that 





will be included in the tables. For the above example, a 
good element would be the Navy Personnel’s social security 


number (SSN). Figure 2 shows a simplified view of how the 





tables in the database example would be connected. 


Navy Personnel 


Thrift Savings Plan 





FK1 | SSN 
Allotment Amount 
Direct Deposit 


Figure 2. Sample Entity Relationship Diagram. 








This pattern of setting up the databases and the tables in 
the databases continues for even more complex queries that 
search across multitudinous tables with numerous attributes 


each. 





Online Analytical Processing (OLAP) provides the user 











with the capability to conduct multidimensional analysis of 





data to include data modeling, trend analysis, and other 





complex calculations. OLAP allows the user to conduct ad hoc 





queries in multiple dimensions providing flexibility and 


speed to the search in databases that contain a very large 





and complex amount of data. OLAP software is best utilized 
when incorporated into the database as it is created. The 
data will be formatted and structured to support rapid, 


multi-dimensional queries. The star schema is a popular 





framework implemented in a relational database. Figure 3 
shows how a simple star schema would store data in the 


database. 










District Descriptor 
RegionID 
Region Descriptor 


ContactFirstName 
ContactLastName 
CustomerType 
BillingAddress 
City 
StateOrProvince 
PostalCode 
Emailaddress 





















ProductDescription 
CategoryDescription 
SupplierName 
UnitPrice 

JUPC Code 















Figure 3. Star Schema Diagram.°® 


As shown in Figure 3, a star schema will have a central 
table called a fact table that contains attributes common to 


the other tables. The dimension tables ar then linked to 





the central table by foreign keys.® The attributes on the 
fact table quantifies the data represented by the dimension 


tables. This allows the combination of data from multiple 





tables that can be aggregated while maintaining their 


meaning. The dimension tables in contrast describe the data 





organized in the fact table. It presents unigue keys 


connecting the tables together. This provides the ability to 


5 Daniel J. Power, Decision Support Systems: Concepts And Resources 
For Managers (Westport: Quorum Books, 2002), 127. 


6 Ibid. 
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quickly perform analysis on very large datasets.’ OLAP 


analysis becomes increasingly beneficial when combined with 





data mining for analysis. 


Cc. DATA MINING 











Data Mining (also called knowledge data discovery, KDD) 
is best defined as the use of automated data analysis 


techniques to find previously unknown connections between 





data stored in a data warehouse. The knowledge gained from 
utilizing these techniques and finding the hidden 
relationships and patterns associated with the data can help 


the user gain a competitive advantage over those who do 





not. Data mining tools are also used to automate the 





process of predicting outcomes from information in large 


databases. 


The data mining techniques can vary from company to 
company as they adjust the techniques to fit their 
particular needs. The basic techniques that are focused on 


in this research are the common categories that are 





currently in use: 1) classification, 2) association, 3) 





prediction, and 4) clustering.? 


1. Classification 


Classification is a data mining approach that develops 





rules that determine if the item belongs to a particular set 


of data. These rules are generally simplistic in nature. 











7 George M. Marakas, Decision Support Systems In The 21°* Century, 2d 
ed. (Upper Saddle River: Prentice Hall, 2003), 332-333. 








8 Daniel J. Power, Decision Support Systems: Concepts And Resources 
For Managers (Westport: Quorum Books, 2002), 143. 
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They often follow a simple logic sequence such as IF the 


data meet X parameters, THEN it will be placed with other 








data that meet the same parameters. These rules can be 
modified to provide a specific percentage of accuracy. These 
modified conditions traditionally are a) exact rule, b) 


strong rule, and c) probabilistic rule.?!° 





a. Exact Rule 


The exact rule gives one hundred percent 


probability that the data will meet the parameters of the 








class that it is placed into. It explicitly fulfills the 





preconditions of the IF-THEN statement. There are no 


exceptions to the rule that are allowed.11 
b. Strong Rule 


The strong rule takes a range of exceptions into 


account when admitting data into a certain classification. 





This rule allows for a range between ninety and one hundred 











percent probability. The data element in this case does not 


have to exactly match.14 


c. Probabilistic Rule 





The probabilistic rule creates a subclass that 





classifies the data based on a measured probability. It does 


9 George M. Marakas, Decision Support Systems In The 21°* Century, 2d 
ed. (Upper Saddle River: Prentice Hall, 2003), 333. 








10 George M. Marakas, Decision Support Systems In The 21%* Century, 
2d ed. (Upper Saddle River: Prentice Hall, 2003), 334. 








11 Thid. 
12 Thid. 
12 





this by relating the conditional probability P (THEN|IF) to 





the probability P (THEN) .13 


The data mining technique utilizing classification 





requires the analyst to set up the classes beforehand in 
which to place the data. The specific questions that the 


company need answered are determined early and the classes 





are set up accordingly to answer them. 








An example utilizing classification would be a video rental 
stores customer rental history. The classes would be set up 


based on several attributes of a typical customer base, 








which would likely include categories such as age, gender, 


movie type, and membership type. This would allow the video 





store to determine the movies most likely to be rented and 








by whom. Based on the data in the classes, they will then be 





able to predict given a pre-determined probability how many 
copies of a new movie to purchase and who they will expect 


to rent based on the type of movie. 


2% Association 


Association rules relates one set of data to others to 





find patterns of events that happen concurrently. This 


relationship is denoted by {event A} > {event B}. This 








shows that when event A takes place event B is also likely 





to take place. While the above mentioned relationship 





indicates each event as a Singular entity, event A and event 





B could each represent a group of events that are associated 


with each other. Similar to the rules that further define 


13 George M. Marakas, Decision Support Systems In The 21°* Century, 
2d ed. (Upper Saddle River: Prentice Hall, 2003), 334. 
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classification, association has two measures represented by 








percentages that lend a numerical value to the interaction 


between events: support and confidence.!4 
a. Support 


Support as it pertains to data mining association 





rules is the percentage of a data set that contain all the 








elements that make up event A and all the elements that make 





up event B.15 In other words, that percentage of the data 


records for which Event (A) -> Event (B). 


b. Confidence 





Confidence is closely related to support except it 








takes event A as the data set and finds how many times event 


B present. 1é 


For example, an association rule may be generated 





that when a video renter rents a children’s movie, he is 





likely to buy candy. This would be expressed as {children’s 
movie} ~> {candy}. Knowing this association would benefit 


the video store because it would allow them, say, to place 





the candy near the children’s movie section in the video 


store. 


Using this example, the support measure would be the 


ratio of those who both Rent (Movie) & Purchase(Candy) to all 


14 Catherine M. Ricardo, Databases Illuminated (Sudbury: Jones and 
Bartlett, 2004), 748. 

















15 catherine M. Ricardo, Databases Illuminated (Sudbury: Jones and 
Bartlett, 2004), 748. 








16 Ibid. 
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those who Rent (Movie). If, out of 1,000 movie rentals, 100 





renters also bought candy, then the support would be 10%. 


Now assume that out of the 1,000 movie rentals, 250 of 


them were children’s movie rentals and that from the 





250children’s movie rentals, 50 of them sold candy at the 
same transaction. The Confidence level would then be 20 


percent to the association rule. 


The association technique to data mining can be useful 





in finding less obvious connections between seemingly 





diverse objects. An example of such a connection that proves 
the value of the technique is the association that men who 
buy diapers frequently purchase beer as well.!’ Such a 


connection is less intuitive but can reap many benefits for 





company advertising and product placement. 


3 Prediction 


Prediction follows the association analysis pattern to 


determin relationships between independent events. 














Prediction differs in the fact that it utilizes the 





completion of an event and past history to predict the next 





event that is most likely to occur. It forms a predictive 


chain of events that can be expected to take place based on 








past data that has been collected and analyzed. Prediction 





follows a timeline of events that only advances when the 
previous event completes. As data collects on specific event 


chains, a percentage can be calculated to indicate the 





probabilities that succeeding events will follow the 





17 George M. Marakas, Decision Support Systems In The 21°* Century, 
2d ed. (Upper Saddle River: Prentice Hall, 2003), 334. 
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sequence.!8 The popular models used for prediction are 





decision trees and neural networks. 


a. Decision Trees 





Decision trees ar sequences of events that are 





derived from the data by using logic statements and rules to 
partition the data sets into increasingly smaller 


categories. This drilling down of the data sets continues 





until a user defined stopping point is reached, or the 


partitioning occurs at the simplest level. Much like 





classification, the decision nodes are IF-THEN statements 





that split the data sets into multiple branches. The 





completed tree will form a complete path of events based on 


the data.!9 This will allow a predictive relationship to 











form, connecting the input to a probabilistic end state. 


b. Neural Networks 


Neural networks are data models that are created 


to adapt to new data as it is integrated into the model. 





Neural networks find patterns from past data to predict 


outcomes.29 A typical neural network follows a three step 





process. First it predicts an outcome, and then compares the 


prediction to the actual outcome. The final step is an 








adjustment of the model, depending on whether there was a 


match or not. This is the simulated learning process.?1 





18 Catherine M. Ricardo, Databases Illuminated (Sudbury: Jones and 
Bartlett, 2004), 749. 





19 George M. Marakas, Decision Support Systems In The 21%* Century, 
2d ed. (Upper Saddle River: Prentice Hall, 2003), 336. 








20 Daniel J. Power, Decision Support Systems: Concepts And Resources 
For Managers (Westport: Quorum Books, 2002), 151. 


21 Tbid., 152. 
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Following the three step process, the neural network 
represents a human reaction and decision making process. A 
human will adapt the decisions that they make as _ new 


information and data becomes available. The neural network 





is developed to mimic that human response to new data. 





A prediction example would follow the purchasing habits 





of a customer captured at a hardware store. It can be 
predicted that when a customer buys drywall, he will most 
likely purchase nails the next visit. After numerous 


customers and visits, the data from their purchases will be 





collected and the sequence of events can be predicted with a 


certain probability of actually occurring. 


4. Clustering 








Clustering techniques are related to the classification 
technique in that they group data based on similar 


characteristics. These groupings are called clusters. Where 








classification determines the groups’ before analysis, 





clustering determines the grouping during analysis based on 





attributes and metrics determined by the user. This 
technique is especially useful when the groupings cannot be 


predefined. 24 


Clustering techniques typically implement certain 














algorithms to defin their cluster parameters. These 











algorithms are; a) Hierarchical and b) Partitioning. 





22 George M. Marakas, Decision Support Systems In The 21°* Century, 
2d ed. (Upper Saddle River: Prentice Hall, 2003), 335. 
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a. Hierarchical 


Hierarchical clustering techniques allow the user 


to analyze nested data at different levels into which data 





fields are sub-divided. It does this by creating a diagram 


Similar to a decision tree that is made up of clusters of 





data. Hierarchical clustering can either be agglomerative or 








divisive. 
Agglomerative hierarchical clustering starts with 


each point of data as an individual cluster. The pair of 








clusters closest together then combines into one cluster. 


This pattern continues for an undetermined amount of 





iterations until a pre-determined number of clusters are 


left or until only one cluster remains. 





Divisive hierarchical clustering takes an opposite 











approach to agglomerative clustering. It starts with a 


Single cluster of all the data points. The cluster then 





splits into two separate clusters. The cluster will continue 











to split until a pre-determined number of clusters are 





reached.23 Typically, only a single cluster will split at 


successive iterations. 
b. Partitioning 


Clustering using the partitioning technique places 


data into cluster sets. These cluster sets do not overlap 





and each data point will only be a member of one cluster.24 


23 Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction To 
Data Mining (Massachusetts: Addison-Wesley, 2006), 515. 





24 Tbid., 492. 
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There are two popular methods to determine how the clusters 


are organized, K-medoids method and K-means method. 


The k-medoids method takes a selected number of 


data points and uses them to represent the centers of 








clusters. Data points are then added to the clusters based 


on the proximity to the initial data points.?° 


The k-means method is the more popular 








partitioning cluster method. The clusters are grouped around 





the centroids. The centroid will be the geometric center of 





the cluster of data points. The initial centroids are chosen 


by the user. The surrounding points are then added to the 





clusters. The centroid location is then updated based on the 





geometric mean of the points. The data points are then 





adjusted to the clusters to which they are more closely 
located. The centroid location then is calculated based on 
the new cluster of data points. The iterations continue 


until the centroid locations no longer change. ?® 


Clustering is often utilized in analyzing medical data. 
For example, a clustering of medical data relating symptoms 


to diagnosed diseases can be useful to find hidden or 


seemingly unrelated symptoms that were previously 
undetected. 
D. CONCEPT TAXONOMY IN DATA MINING 





A data mining taxonomy can be developed based on the 


descriptions and definitions defined above. The taxonomy 


25 Pavel Berkhin, Survey Of Clustering Data Mining Techniques (San 
Jose: Accrue Software, 2002), 15. 





26 Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction To 
Data Mining (Massachusetts: Addison-Wesley, 2006), 497. 
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developed places the different data mining techniques into 


an ordered representation of the relationships. 


Data Mining 
Techniques 





— 


Figure 4. Taxonomy of Data Mining. 








The taxonomy shown in Figure 4 will be utilized and 


followed in the next chapter. This taxonomy gives a base for 








further research to be implemented into an IDTE combining 








the data-driven technologies described with other DSS 


technologies, such as model-driven and knowledge-driven. 


E. SURVEY OF AVAILABLE SOFTWARE 


A survey of available data mining software at the Naval 
Postgraduate School found two main software programs in use: 


1) SPSS Clementine and 2) Megaputer PolyAnalyst. 


1. SPSS Clementine 


SPSS Clementine provides a graphical user interface 
that shows the steps in the data mining process as they are 
being used. Clementine offers the user the ability to 


utilize the techniques of data mining mentioned in this 





chapter. It is designed to allow the user to choose between 
20 


Clustering, Classification, Association, and Prediction.2? 








Figure 5 shows the Clementine interface and start-up screen. 


® Stream1* - Clementine 11.1 
File Edit Insert View Tools SuperNode Window Help 


DV BFROSoreoHirG aA 


| Classes 

(unsaved project) 
_ (& Business Understanding 
_ & Data Understanding 
~~ (& Data Preparation 
~ & Modeling 
_-(& Evaluation 
~-(& Deployment 








@ Favorites | @ Sources | @RecordOps | @FieldOps | A Graphs @ Output |» @ Export 


® ®@& OO ® @©@ © ©@68@89 © @ © 


Binary Classifier NeuralNet C50 C&R Tree QUEST CHAID Decision List Kohonen K-Means TwoStep Time Series Anomaly Apriorl GRI Carma Sequence PCA/Factor Feature Selection Regression Logistic C 





4 ee 
O Server: Local Server 


Figure 5. Clementine Start-up Screen. 


J BOMB /134MB | 





2. Megaputer PolyAnalyst 


Megaputer PolyAnalyst is a robust software package that 





utilizes several data mining techniques for knowledge 


discovery. The techniques that PolyAnalyst can perform are: 





Classification, Clustering, Association, and Prediction 


which includes Pattern Learning and Trend Analysis.28 


27 spss Clementine 11.1 Specification Brochure. 2005. Online. 
Internet. 12 Feb. 2008. Available from 
http://www.spss.com/pdfs/CLM11SPClr.pdf. 





28 Metaputer PolyAnalyst Brochure. 2007. Online. Internet. 12 Feb 
2008. Available from http://www.megaputer.com/polyanalyst.php. 
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Figure 6 shows a snapshot of the features available in 


the software package. 


Machine Learning 


4 


aoe ey °° 


r ; 
: Cleansing 
ee a _ 


it 
— ne, 





Figure 6. PolyAnalyst Features Diagram. ?? 





Megaputer PolyAnalyst combines these techniques into an 
easy to read graphical interface and provides custom reports 
on the results. Figure 7 shows the user interface and 
startup screen for PolyAnalyst. From this startup screen the 
user can navigate to an existing project, create a new 
project, or go to a tutorial section that has fifteen 


different tutorials. These tutorials will walk the user 





through a variety of techniques from a simple introduction 
to data mining and PolyAnalyst operation to more in-depth 


tutorials. 





29 Metaputer PolyAnalyst Brochure. 2007. Online. Internet. 12 Feb 
2008. Available from http://www.megaputer.com/polyanalyst.php. 
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Figure 7. PolyAnalyst Start-up Screen. 


FEF. SUMMARY 





In this chapter we reviewed the essential data 








warehousing and data mining structures and technologies. A 
formalized taxonomy of data mining technologies’ was 


presented and we concluded with a survey of the available 








software at NPS, SPSS Clementine and Megaputer PolyAnalyst. 























In the next chapter (Chapter - Data Mining Schema) 
we will provide a concept Decision Technology and Data 


Mining Schema to be utilized in the development of the IDTE. 
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III. DATA MINING SCHEMA 


A. INTRODUCTION 








In this chapter, we provide a concept Decision 





Technology and Data Mining Schema to be utilized in the 








development of the IDTE. The tables needed for the operation 








of the IDTE will be described and the relationships between 


the tables will be identified. These schemas are not all 





inclusive and leave the ability to add additional tables and 








relationships as necessary to increase the range of decision 
technologies as they become available to the user. This 


chapter and thesis focus on data-driven decision 











technologies and as such will not develop in detail the 





other decision technologies in the schemas or in the 


relationships beyond the first level. These tables and their 








relationships will be the basis for the creation of the 





concept user interface described in the next chapter. 


B. DATA MINING SCHEMA 


The data mining schema was created utilizing the 


Tabledesigner program. The following is a brief description 





on how to read the Tabledesigner diagram and interpret the 


symbols contained within. 


Objects: An object can be regarded as the equivalent of 


an entity, and typically will be represented by at least one 
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corresponding database table in a database. Figure 8 shows 


the Decision Technology object. 


E) Decision Technology 
G £DT_Id 1-1 
og DT_Platforr 4_4 


GO DT_Type 4-4 
og DT_Type_l 4.4 p> 








Figure 8. Decision Technology Object. 


Each object in Tabledesigner may contain one or more of 
the following attributes: simple attribute, group attribute 


and object attribute. All attributes are assigned 





cardinalities which can be zero-to-one (0.1), one-to-one 


(1.1), zero-to-many (0.N), or one-to-many (1.N). 


Simple attributes: A simple attribute is equivalent to 


a data item representing a single piece of information. In 





terms of database design, data items represent columns in a 


database table. Referring to Figure 8, O DT_Tyre 1-1} would be 


one of the data items for the object Decision_Technology. 











Group attributes: A group attribute is a container 
within an object that collects items that are conceptually 


related. It does not appear anywhere in a data table, rather 





it visually makes it easier for the user to see and relate 
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appropriate data items. Figure 9 shows the group attribute 
DT_Platform which groups several simple attributes 


pertaining to the platform. 


os DT_Platform 4.4 
& Software_System 4_4 
6 Version 4_4 
GO No_Licenses 4_4 
6 SerialNo 9_4 


OG Date_Purchased 4_1 
G Amount 4_4 


& Company Name 4_4 
GO Web_Site 9_; 








Figure 9. Data Group Example. 


Object attributes: Object attributes form the mechanism 


for specifying relationships between two objects. These 





relationships are represented by including an object 


attribute in another object which requires the reverse 








process as well. Figure 10 shows this relationship, as Data 





Mining_DT has an attribute in the DM_Technigue Object and 





vice versa. This represents a relationship between the two. 
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The details of that relationship are indicated by the 


subscript numbers or cardinality that follows the attribute. 


EE] Data Mining_DT EE) DM_Technique 


og $DMing_DT_Id 4.4 > & #DM_Tech_Id 44 


E) #{DataDriven_DT | 1_, & Description 9_4 


og Supports 4. =  Tutorial_Link 9_4 


E] |DM_Technique | 4_, O Reference 9_) 


os BelongsToCategory 4) = 


B [DM Caen] 


og Is4FeatureOf 94 — 


E] [Data Mining_DT | 4_, 





Figure 10. Object Link Example. 





Cardinalities allow the creator to determine the number 











of allowable instances in each direction of a two-way 


relationship that is required. 





One-to-one relationship: This would be indicated in the 
schema by the Maximum Allowed (the second subscript number) 
property set to one on both sides of the relationship. This 
means that one instance can be associated with only one 


other instance of a related item. 


One-to-many relationship: This is a relationship where 
one instance can be associated with many other instances, 
and many instances can be associated to a single instance. 
This is shown by one instance having its Maximum Allowed 
property set to one and linked to other instances that have 
their Maximum Allowed property set to N for indicating 


numerous. 


Many-to-many relationship: This is a relationship where 


many different instances can be associated with many other 
28 


different instances. This is shown by having the Maximum 


Allowed property set to N on both sides of a two-way 





relationship. 


Figure 11 is an initial data mining schema emphasizing 


data-driven decision technology process. 



















E) DataDriven_DT 
of €Data DT_Id44 > 
B [DecionTechroogy] 

B [baa Mira OT Jo, 
Bi [Dat Warehoute OT] 
















E) ModelDriven_DT 


Decision Technology 


© DocumentDriven_DT 


E) Decision Technology 
G £DT_Id 4.4 


E) CollabDriven_DT 


Decision Technology 


og DT_Platform 4.4 ~ 


 Software_System 4_4 





O Version 4.4 






 No_Licenses 4_4 





& SerialNo 9.4 E) |OLAP_DT | 9. |] #[Decision Technology | 4_4 
6 Date_Purchased 4.4 E) [Data Visualization_DT | 9.4 
@ Amount 4-4 


OG Company Name 4_4 




























@ Web_Sit 

3 eg - is [=] DM_Technique 
08 POC 9.4 —— 

API 4 28 #DMing DTLIdq4 7 6 #DM_Tech_Id 44 


E) #DataDriven_DT } 4_, 


og Supports 4—y = 


6 Description 9.4 

& Tutorial_Link 9.4 

GO Reference py 

o€ BelongsT oCategory 4.— = 


BB [DM Categ00 | 1. 


og Is4FeatureOf op) 


| [Data Mining_DT | 44 


 ImportFormat 9_1y 
G ExportFormat o.ty 
G DT_Type 4.4 
og DT_Type_l 4.4 > 
B o 
a re 
B 5 
B on 
&) [CollabDriven_DT | p.1 











© Data Warehouse_DT 






©) OLAP_DT 
DataDriven_DT 


E] DM_Category 
G& £Category_ld 4-4 


og ContainsTechnique 4. = 


E] [DM_Technique | 4_4 





E) Data Visualization_DT 





E) #DataDriven_DT | 4_4 


‘ »| 


Figure 11. Data Mining Schema. 
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This schema can be represented as a 


Explorer format which would look similar 


a] IDTE 


taxonomy in Windows 


to Figure 12. 








- Decision Technology 
&}- Data-Driven 
: fi Data Warehousse 
4) OLAP 
=| Data Mining 
&]- Association 
&1- Classification 
&)- Classification Techniques 
-~ Decision Tree 
bon Nearest Neighbor 
-~ Neural Networks 
'-- Rule Induction 
&}- Clustering 
i oe Clustering Techniques 
-~ Agglomerative Hierarchical 
i CLIQUE 
-~ K-Means 
i K-Medoids 
)- Prediction 
=|. Prediction Techniques 
e~ CART 
i» CHAID 
-~ Decision Tree 
-~ Multiple Regression 
i. Neural Network 
- Document-Driven 
- Model-Driven 
- Knowledge-Driven 
-- Collaboration-Driven 
Data Visualization 


























8 
# 
8 
8 
eH 














Figure 12. Windows Explorer Taxonomy. 


Table 1 outlines the attributes and provides a brief 





description, data type and example of the items included in 


the data mining schema. 
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Attribute Description Attribute | Example 
Type 
DECISION A single software Object Megaputer, 
TECHNOLOGY system in the IDTE Clementine 
DT_Platform Salient Group Megaputer 
information about 
the software, some 
but not all 
attributes of 
which are hinted 
at here 
Attributes listed 
DT_Type Decision Group Data-driven, 
technology type Model-driven, 
etc 
DataDriven_l One instance for 
each data-driven 
software DT; 
inherits the key 
from the 
corresponding 
Decision 
Technology 
instance 
DD_Category Group attribute Group Data Warehouse, 
shows the OLAP, Data 
generalization Mining, and/or 
hierarchy another Data 
level deep with Visualization 





DTs 





data-driven 











DataMining_l 


One instance for 
each Data Mining 
data-driven 
software (e.g., 
Megaputer) ; 
inherits the key 
from the 
corresponding 
DataDriven_DT 
instance 
































DM_Technique 


One instance for 











Decision_Tree 
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each data mining 
technique 








DM_ Category 





One instance for 
each of four data 
mining categories 


Classification, 
Clustering, 
Association and 
Prediction 











Description 
> Reference 


These attributes 
refer to the 
documentation 
available for this 
software decision 
technology; these 
are not intended 
to be inclusive; 






































additional 
attributes may be 
desirable 
Table 1. Attribute Table. 
Table 2 outlines relationships and provides a brief 





description, and example for the data mining schema. 





Relationship 


Description 


Example 





Supports 1- 
N.DM_Technique 1-1 


shows one-to-many 
relationship that 
an instance of 
DataDriven_DT may 
have many 
corresponding 
instances of 
DM_Technique 




















IsAFeature O0-N 
.DataMining_DT 1-1 


shows zero-to-many 
relationship 
between 
DM_Technique and 
DataMining_DT 














the DecisionTree 
technique may 
appear as a feature 
in both Megaputer 
and Clementine 











BelongsToCategory 








one-to-many 





the DecisionTree 
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DM_Category 


relationship 
showing that an 
instance of a 
DM_Technique may 
belong to at least 
one and perhaps 
many instances of 
DM_ Category 








technique may 
appear as a feature 
in both Megaputer 
and Clementine 





ContainsTechnique 
1-N.DM_Technigque 1- 
1 


shows one-to-many 
relationship 
between DM Category 
and DM_Technique 











the category 
Prediction may 
include Regression 
and NeuralNetworks 
techniques as well 
as many others 

















Table 2. 


Relationship Table. 
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Cc. DATA MINING SCHEMA EXTENDED TO INCLUDE CASE STUDY 












































[—) DM_Technique 
G #DM_Tech_Id 4.4 
6 Description 9.4 

6 Tutorial_Link 9.4 
O Reference oy 


[) Decision Technology 
G £€DT_Id 4.4 
og DT_Platform 4.4 


) Case Study 


& £Case Study_ID 4_4 
6 Description 9.4 
GO Software_System 4_4 


GO Reference 9.1 


08 Model 4.) = 6 Version 4-4 


E) #Decision Technology | 4_4 


GO No_Licenses 4.4 og BelongsToCategory 4 = 


eg Process 4. 7 & Serial No 9.4 el 1-1 
|) #[DM_Technique a | @ Date_Purchased 4_4 og Is4FeatureOf 9.4 
fel 1-N Amount 4.4 E] [Data Mining_DT | 4_4 


6 Company Name 4_4 
GO Web_Site 9.4 

GC API 9.4 

G ImportFormat gry 


Bleaesub lon 





) Data_Set 


GO £DataSet_ld 4_4 
& File_Location 4-4 
GO Date_Created 9.4 
6 Date_Last_Updated 9.4 
Format 4.4 
of POC 9.4 > 
G *Name 4.4 
og Address 9.4 D> 


GO EMail 4_4 


og Phone g.4 > 


G ExportFormat g_py 

G DT_Type 4-4 

of DT_Type_l 4.47 
B a 
B 
is o4 
B 0 
iB 0. 








El [Case Study | 1.1 Bl [Case Study Jon 
Figure 13. Schema including Case Study. 


The schema in Figure 13 has been enhanced to include 





applications in the form of the Case Study object. New or 


modified objects are described below in Table 3. 





Attribute Description Attribute | Relation | Description 
Type ship 





Case Study This object is 
intended to 
capture 
relevant 
information 
about any data 
mining 
applications 
which have 
been developed 
and saved in 
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the IDTE 








Model 1-N 


Intended to 
capture the 
fact that an 
application 
may have 
included one 
or more 
Decision 
Technologies 
each of which 
may use one or 
more 
DM_Techniques 
and Data_Sets 








Group 









































DM_Technique: Case Study |A 
O-N DM_Technique 
may have 
appeared in 
QO, 1, or more 
instances of 
Case Study 
Data_Set This object is Case Study | A Data_Set 
meant to may be used 
O-N : 
capture in O03 Lp “or 
instances of more 
reusable instances of 
datasets in Case Study 
the IDTE. 
Note that this 
object will 
need 
considerably 
more 
attributes to 
fully describe 
each data set 
Table 3. Case Study Attribute/Relationship Table. 
D. DATABASE TABLES 


The Tabledesigner program allows the user to create a 


database from the model/schema created in the program. 


on th 





in Microsoft Access. 


schema depicted in Figure 8, 


From that database, 
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a database was created 





relationship 





Based 


diagrams and database tables were produced allowing the 





ability to created user queries. 


DM_Technique Reference 


Decision Technology ModelDrven DT D_ DM_Technique 
'D ——_—_—<$_—_—_|—§_—_ | | |DM_Technique_ID_FK 


DT_Id D_ Reference 
DT_Platform_Software_System Decision Technology_ID_FK DM_Category 
DT _Platform_Version SSS 
DT_Platform_POC_Name A 

DT Platform_POC Address Steet KnowledgeDriven DT 
DT Platform_POC_Address_city D 
DT_Platform_POC_Address State Decision Technology_ID_FK 
DT _Platform_POC_Address_Zip ETS 
DT_Platform_POC_EMail [Pee 
DT Platform_POC_Phone_Area Code CollabDriven_ DT 
DT_Platform_POC_Phone_Phone Number D 
DT Platform No Licenses Decision Technology_ID_FK 
IDT_Platform_Date_Purchased 
DT_Platform_Amount 


Dr Patform_ Company Name Decision Technology_ExportFormat DM_Technique BelongsToCategory 
DT_Platform_Web_Site 
: E 


Category _Id 




















DT_Platform_Serial_No D 5 
DM_Category_ID_FK 00 


DT Platform_API Dezson Technology_ID_FK 
Type DM Tecmique_1D_FK 


DT Type IDT_Platform_ExportFormat 




















Decision Technology ImportFormat 


09} |D_ 
Decision Technology _ID_FK Data Mining OT po |D_ 
DT_Platform_ImportFormat = 00| |DM_Technique_ID_FK 


Data Mining_DT_Supports 





- f__ 
DocumentDriven DT DataDriven_DT_DD_Category 

o| |D_ 
Decision Technology_ID_FK 

















DataDriven DT — 
o| |D. DataDriven_DT_ID_FK Data Visualization DT 


Dsta_DT_I¢ Decision Technology 1D_FK D_ 

















Figure 14. Microsoft Access Entity Relationship Diagram. 


From the diagram presented in Figure 14, we can develop 


the database tables. These database tables, examples shown 
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in Figures 15, 16, and 17, are the basis for the SQL queries 
described in section E. The tables are shown with data 


records. 





2 Clustering 





3 Prediction 
4 Classification 








Record: (14 4) 1 (DDN) of 4 


Datasheet View 





Figure 15. DM_ Category Table. 


3 || & x 


Type a question for help > 


‘J Z . a 


: L Z Ys ¥y F 73 ~| © 


DM_Tech td Tutoral_Link 


2 Multiple Regression 
3 Neural Network 
4 Decision Tree 
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Datasheet View 





Figure 16. DM_Technique Table. 


37 


G = 
, » ee 


a We DD | ee) oe , We rae 2 DM | AS oa) w | 
DM_Category_ID_FK DM_Technique_ID_FK 
2 / 























1 
4 
2 
3 
3 
4 





Record: [14] «)[ 1 [P_](E pak] of 6 


Datasheet View 





Figure 17. DM_Technique_BelongsToCategory Table. 


Fiqure 15 shows the records of the different data 


mining categories. Figure 16 shows a selection of data 











mining techniques. Figure 17 shows the many-to-many 





relationship between the categories and the techniques. 








E. QUERIES 


The SQL query language allows the user to retrieve data 
from the database tables by specifying what data they are 
looking for. The SQL query language specifies the SELECT 





statement as the means of creating the queries the user 


wants answered. The general format for an SQL query follows: 
SELECT column_name(s) 
FROM table name 


The SELECT determines what columns that the user wants to 
search for the data. The FROM determines what database table 


the user wants to look for the data in. The results are 
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displayed in a result table. An example of a general SQL 


query from the database table provided in Figure 16 would 





be: 





SELECT DM_Technique.Technique_Name 


FROM DM_Technigque; 





This SQL query would return a result table like the one 





in Table 4. The columns would be filled with the data 





contained in the database, for example a listing of the data 


mining techniques. 


|| Technique_Name 
(3 


a Multiple Regression 


B Neural Network 
ie Decision Tree 








Table 4. Data Mining Technique Result Table. 


Additionally, the queries can be expanded to join two 
or more tables into a single query to combine and organize 
the data to the user’s specifications. An example of a more 
advances SQL query that joins the data mining categories to 
the data mining techniques to produce a result table showing 
what techniques are part of which categories would look like 


the following: 





SELECT DM_Category.Category_Name, 
DM_Technique.Technique_Name 














D 

FROM DM_Category INNER JOIN (DM_Technigque INNER JOIN 
DM_Technique_BelongsToCategory ON DM_Technique.ID_ = 
D n 
D 
D 





























I ique_BelongsToCategory.DM_Technique_ID_FK) ON 
DM Category.ID_ = 












































DM_Technique_BelongsToCategory.DM_Category_ID_FK; 
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This SQL query would return a result table like the one 


in Table 5. 


K-Means 
ie Prediction Decision Tree 
| _|Prediction Multiple Regression 


Category Name Technique Name 
| | Category Name | Technique Name 
> | SREeni 


i Prediction Neural Network 
i Classification Neural Network 
wa Classification Decision Tree 








Table 5. Data Mining Category/Technique Result Table. 





There are numerous SQL Queries that can be created that 
can answer most questions the user may have regarding what 
techniques are associated with which category, which 
platform can perform which techniques, or what data driven 


technology is appropriate for a given case study. 


F. SUMMARY 








In this chapter we reviewed the initial decision 
technology and data mining schemas, the database tables, and 
how SQL queries created from those database tables would 


look. These include the adaptation of case studies into the 





schema. The schemas allow for further development into other 
decision technologies beyond the data driven/data mining 


technology. 











In the next chapter (Chapter IV - User Interface) we 





will utilize the schema and database tables described as the 
foundation for building a conceptual user interface for the 


IDTE. 
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IV. USER INTERFACE 


A. INTRODUCTION 





In this chapter, 


the IDTE. 








for A hypothetical 











typical 
guide 





Clementine or Megaputer 


use 


IDTE user through the screens of the 
the user to one of two existing programs, 


PolyAnalyst, 





decision tree analysis on a dataset. 








shown of the conceptual 


IDTE User 





Case 


will 





IDTE 





for 


we design a conceptual user interface 
motivate a 


that will 





eit 


Eher SPSS 


performing a 








this process. 


A user in general may want 








tasks or look-ups from the 


of the IDTE. 











specific technique, 


to a software environment which su 


user may also want to utilize the 


Case Studies and Applications for 


a specific problem. 


associated links to tutorials to familiariz 


existing software platforms. The 


resource or teaching aid in the 





field. Users can also browse to 





Data-Driven / 


decision tree analysis, 








The user can access the 





Data-! 





to perform several different 





ppeorts that technique. 


guidance in how to set 








Data Mining segment 


Screen captures will be 


Interface that supports 





The Use Case illustrates someone looking for a 


and navigating 


A 


IDTE to scan existing Use 


Up 


IDTE and its 





themselves with 








Driven / 





IDTE can thus be used as a 


Data Mining 


identify what techniques 


and/or software platforms exist in the overall data mining 


universe. 


B. IDTE FLOWCHART 


The use case for this chapter is a generic description 


of the process needed to analyze a data set using a decision 
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tree. Once a platform has been identified and selected, the 





IDTE will transfer the user to the desired program 





environment. 


The first step in the development of the user interface 





is to create a navigation flowchart that shows how the user 





will progress through the different screens of the IDTE. The 








flowchart in Figure 18 follows only the path that a user 





would follow to select a software package for applying a 
decision tree technique for analysis. The other paths are 


not shown to their completion but would have similar options 








available based on associated attributes. 
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—_| 
— Data Visualization 
Page 
Data Warehouse 
Data DT Page Page 

OLAP Page 

— = =a 







Data Mining Page 








Select 
















































































































Category 
Tutorials Page 
Select Category a 9 
es Neural Network 
Prediction Page | Tigdiateue Reem 
| | Case Studies 
— sg Page 
Clustering Page 
— 
—_— oxi Decision Tree Select 
ce Technique Page Destination 
Link 
Classification —_— — References Page 
|__+p Page 
CART Technique 
>] Page 
oa Google Search 
A iation Page Platform Page P 
ssociation Pag —— 9) —E age 
| 
_— —N | —_ 
] 
CHAID Technique | 
>| Page 
—_ " Select Platform 
| Additional 5 | 
| Techniques can be Renee | 
listed as —>| i 
| Appropriate Technique Page | 
A _ — Additional 
Beene ey be Clementine Metaputer 
_ 
— _— 
Figure 18. IDTE Flowchart. 
The flowchart does not show all possible 


interconnections in that each page can in principle be 


connected to every other page. The user will be able to link 
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to any page based on a selection from a Windows Explorer 





type tree available on each page viewed. This is explained 


further in the next section. 


Cc. IDTE USER INTERFACE 








The conceptual IDTE was created with Microsoft Visual 


Studio 2005 using Visual Basic. The choice of Visual Studio 





was based on familiarity with Microsoft programs and user 
interfaces. Consistency and compatibility with the installed 
Windows platforms prevalent at Naval Postgraduate School 


computer laboratories was an additional factor. 


Fach screen provides simple buttons that are labeled to 


indicate what will happen when they are selected. Each 








screen will have the same look and feel to the user, 





including the familiar minimize, maximize, and close buttons 





in the right top hand corner. 


Figure 19 is the welcome screen that allows the user to 














become accustomed to the format that the IDTE will follow. 








The screens will be designed with four distinct areas of 





functionality. The first is the title area, centered at the 





top of the screen, which identifies the current screen. 











Below and to the right of the title area is a window with a 





basic definition of the screen. To the left of the 


definition area is the Windows Explorer navigation area. 





This area allows the user to expand nodes in a familiar tree 
fashion displaying the different screens available for 
selection. By clicking on the “+” sign the tree will be 


N“ NN 


expanded to the next lower level and by clicking on the “- 








sign, the tree will collapse the level up to the next higher 





category. Users always have the option to click on any title 
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in this section to proceed directly to the desired page 





after they have become more familiar with the type of 
decision technology they require, as shown below. The final 
area (the button area) is located in the lower right hand 
side of each screen. Each screen other than the welcome 


screen will have the five buttons shown in Figure 2 and a 





“Back” button that will return to the previous screen. The 





buttons from top to bottom are: Platforms, References, Case 





Studies, Tutorials/Demos, and Google Search. 


The “Platforms” button functionality depends upon which 





page is current. When selected, it will navigate the user to 





a screen that lists the platforms available to meet the 


requirements for that particular screen. 





The “References” button opens a separate window with a 





list of references that the user can access to gain further 





insight into Decision Technologies. This selection is also 








context sensitive, showing only relevant references for the 


focus of the current screen. 


The “Case Studies” button opens a separate window with 
a list of applicable case studies based on the current 
screen. Case studies can be added to a database and 
categorized based on technology, category, technique, and 


platform. 





The “Tutorial/Demos” button opens the tutorial supplied 


by the platform software relevant to the current screen. 





These will also be opened up in separate windows so the user 




















can toggle between the IDTE and the tutorial. 
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The final button is a “Google Search” button. This 
button opens a separate window to allow the user to Google 


search as required for ad hoc information. 


Welcome to the IDTE 


The Integrated Decision Technology Environment - 


(4). Decision Technology The IDTE provides access to a wide range of 
decision technologies for data analysis, modeling, 
and knowledge dynamics at NPS. The IDTE 
provides advanced capabilities for building decision 
support systems from one or more existing software 











Figure 19. IDTE Welcome Screen. 





Once users enter the environment they will be greeted 





by a Decision Technologies screen (Figure 20). The Decision 





Technologies are listed on the left of the screen. If the 
user does not know which technology is appropriate for his 


purpose, he can utilize a “mouse over” feature to hover the 











mouse over the words in the Windows Explorer tree area to 
bring up a pop-up window with the definition of that 
Technology from the Title section of that page. This will 


allow the user to make a decision about what selection to 
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navigate to without a trial and error approach on which 
technology to follow. We assume for this use case that we 


want to select the Data Driven Technology. 





=) Decision Technology Decision Technologies are interactive computer- 
)- Data-Driven based systems or subsystems intended to help 
i re decision makers use communications technologies, 
Document Driven data, documents, knowledge and/or models to 
(4) Model-Driven identity and solve problems, complete decision 


- Knowledge-Driven process tasks, and make decisions. 
+) Collaboration-Driven 





Figure 20. Decision Technology Screen. 


Figure 21 shows the Data-Driven Technology Screen, 





which contains the four subject areas of Data Warehousing, 
OLAP, Data Mining and Data Visualization. At this point, we 


can see how the interface aligns with the Windows Explorer 





paradigm and so will proceed to the Data Mining area which 


is the focus of this work. 
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Data-Driven Technology 


&)- Decision Technology Data-Driven Decision Technologies provide 
&)- Data-Driven neers with 7 pears to store me sort 
i through vast amounts of data by leveraging data 
: Data Warehouse warehousing and data-mining. These data-orented 
(4) OLAP decision technologies can assist decision-makers in 
()- Data Mining making better and more informed decisions in 
(4) Data Visualization shorter durations of time. 


+). Document-Driven 
#}- Model-Driven 

#)- Knowledge-Driven 
#)- Collaboration-Driven 





Figure 21. Data-Driven Technology Screen. 


The use case we are considering is one where an analyst 





is searching for a platform which can create decision trees 
from datasets. Exploding the Data Mining option yields the 


screen in Figure 22. 
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Data Mining Techniques 


©) Decision Technology Data Mining (also called knowledge data discovery, 
=} Data-Driven KDD) is best defined as the use of automated data 
ss analysis techniques to find previously unknown 
i Data Warehouse connections between data stored in a data 
- OLAP warehouse. The basic techniques are the common 
G- Data Mining categories that ere currently in use: classification, 
: G+ Association association, prediction, and clustering. 
i Classification 
)- Clustering 


(#)- Prediction 
: G+ Data Visualization 
(8) Document-Driven 
- Model-Driven 
(4) Knowledge-Driven 
(#)- Collaboration-Driven 





Figure 22, Data Mining Techniques Screen. 





Using the mouse-over feature, the user can determine 


which data mining category is most likely to contain the 





Decision Tree technique. In this case, the Prediction 





Technique is the appropriate category, and the associated 


screen is shown in Figure 23. 
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=) Decision Technology Prediction follows the association analysis 
=) Data-Driven pattem to determine relationships between 
i. independent events. Prediction differs in the fact 
Data Warehouse that it utilizes the completion of an event and 
- OLAP past history to predict the next event that is most 
&)- Data Mining likely to occur. It forms a predictive chain of 
Gf). Associstion events that can be expected to take place 
a ——e based on past data that has been collected and 
en analyzed. Prediction follows a timeline of events 
G#)- Clustering that only advances when the previous event 
)- Prediction 
© Prediction Techniques 


-~ Decision Tree 
~~ Multiple Regression 
‘~ Neural Network 
(4) Data Visualization 
~ Document-Driven 
~ Model-Driven 
- Knowledge-Driven 


.. Callaharatinn-Nriven 


db 





Figure 23. Prediction Techniques Screen. 


The user then will navigate to the Decision Tree screen 





(Figure 24). This screen, as the others, provides the user 
with a brief definition of Decision Trees, the Windows 
Explorer navigation tree, and buttons to perform select 


tasks. 
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Decision Tree 


) Decision Technology Decision trees are sequences of events that are 

&)- Data-Driven are from the aa by using logic — and 
i db. rules to partition the data sets into increasing 

i Data Warehouse smaller categories. This drilling down of the data 

= OLAP sets continues until a user defined stopping point is 
=]: Data Mining reached, or the partitioning occurs at the simplest 
Gi). Association level. The completed tree will form a complete path 
of events based on the data. 








(4) Clustering 
©) Prediction 


Multiple Regression 
: Neural Network 
)- Data Visualization 
- Document-Driven 
» Model-Driven 
» Knowledge-Driven 
.. Callaharatinn-Driven 








& 
iB 
Es 
rs 





Figure 24. Decision Tree Screen. 


From this screen, selecting the Platforms button shows 
the platforms available at NPS to perform Decision Tree 
analysis, namely SPSS Clementine and Megaputer PolyAnalyst. 
The SQL query linked to the Platforms button would be: 


SELECT [Decision Technology] .DT_Platform_Software_System, 
DM_Technique.Technique_Name 








FROM DM_ Technique INNER JOIN (([Data Mining_DT] NNER JOIN 
[Data Mining _DT_Supports] ON [Data Mining_DT].ID_ = [Data 




















Mining_DT_Supports]. [Data Mining_DT_ID_FK]) NNER JOIN 
([Decision Technology] INNER JOIN (DataDriven_DT INNER JOIN 
DataDriven_DT_DD_Category ON DataDriven_DT.ID_ = 
DataDriven_DT_DD_Category.DataDriven_DT_ID_FK) ON [Decision 
Technology].ID_ = DataDriven_DT. [Data_DT_Id_Decision 
Technology_ID_FK]) ON [Data Mining_DT].ID_ = 


DataDriven_DT_DD_Category. [Data Mining_DT_ID_FK]) ON 
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DM_Technique.ID_ = [Data 


Mining_DT_Supports].DM_Technique_ID_FK 











WHERE (( (DM _Technique.Technique_Name)="decision tree")); 


Figure 25 shows the screen as it would appear to the 


user once the Platforms button was selected. 


| a! Decision Tree folie ss) | 


Decision Tree Platforms 





~- Decision Technology Metaputer PolyAnalyst is robust software package 
©) Data-Driven that utilizes several data mining techniques for 





knowledge discovery. The techniques that 
a Data Warehouse PolyAnalyst can perform are: Classification, 
a) OLAP Clustering, Association, and Prediction which 
&)- Data Mining includes Pattem Leaming and Trend Analysis. 
: (4) Association 
F)-- Classification SPSS Clementine provides 6 graphical user 
:)- Clustering interface that shows the steps in the data mining [Jj 
: = process as they are being used. Clementine 
» Prediction offers the user the ability to utilize the techniques 

)- Prediction Techniques of data mining. It is designed to allow the userto _ 

imhanee hetween Chicterinn Claeeificratinn 














Multiple Regression 
Neural Network 
(4) Data Visualization 
(4 Document-Driven 
(4) Model-Driven 
[4 Knowledge-Driven 


1.. Callaharatinn-Driven 








Figure 25. Decision Tree Platforms Screen. 


Figure 26 and Figure 27 show the PolyAnalyst and 
Clementine start-up screens respectfully. From there the 


user can begin the decision tree and data entry. 
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Load new data Open existing project 


Open last project View Tutorial 


Show this screen at startup 
Create a new project from a data 
source or open an existing project. 
Ifyoutte new to PolyAnalyst, view 
the tutorials to learn how to use the 
system. 


Back Next Finish Cancel 








Figure 26. Megaputer PolyAnalyst Start-up Screen. 


Stream1* - Clementine 11.1 ~ |follx 


File Edit Insert View Tools SuperNode Window Help 


Clementine 


5 | Outputs | Models 








O Server: Local Server | 80MB/ 134M 
-! 2] 


Figure 27. Clementine Start-up Screen. 





Also from the screen in Figure 24, selecting the Case 
Studies button shows the Case Studies available at NPS that 
53 





have utilized Decision Tree analysis. The SQL query that 


would be linked to the Case Studies button would be: 





SELECT [Case Study] .Description AS [Case Study_Description], 
DM_Technique.Description AS DM_Technique_Description 











































































































FROM DM_ Technique INNER JOIN (([Case Study] INNER JOIN [Case 
Study_Model] ON [Case Study].ID_ = [Case Study_Model].[Case 
Study_ID_FK]) INNER JOIN [Case Study_Model_Process] ON [Case 
Study_Model].ID_ = [Case Study_Model_Process]. [Case 
Study_Model_ID_FK]) ON DM_Technique.ID_ = [Case 
Study_Model_Process] .DM_Technique_ID_FK; 
WHERE (((DM_Technique.Technique_Name) ="decision tree")); 

The remaining buttons on the screen (Reference, 


Tutorials/Demo, and Google Search) would be attached to a 





hyperlink that would open automatically open an Internet 
Explorer window and navigate to the appropriate web page. 


Examples of the hyperlinks are: 


m “Reference”: one might see links such as 


http://infogoal.com/dmc/dmcdwh.htm 








= “Tutorials/Demo”: a link would point to a location on the 
server that held the software platform. For example, the 
Megaputer PolyAnalyst Tutorial would be hyperlinked to 


the following file location: 








C:/Program%20Files/Megaputer%20Intelligence/PolyAnalyst%2 
5.0/Help/toc.html 





=m Google: this would redirect the user to 


http://www.google.com, or alternatively one might embed 











the Google search software within the IDTE environment so 
the search engine could be invoked at any stage of the 


process. 
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D. FUTURE ENHANCEMENTS 








The IDTE can be enhanced with additional features as 





requirements are identified. One particularly desirable 


feature would provide a local, search-based capability. This 





enhancement would place a search bar on the navigation pages 





that allows the user to specify search criteria for 





identifying screens from the Decision Technology Explorer 
outline. This would provide an alternative to the 


hierarchical navigation which we’ve designed. So if a user 











types “decision trees” into the search engine, the IDTE 








would navigate directly to the Decision Tree Page shown in 
Figure 24 without having to navigate through the screens in- 


between. 


E. SUMMARY 








In this chapter we developed a conceptual flowchart for 








a preliminary version of the IDTE user interface, focusing 
upon the data-driven and data mining decision technologies. 
The interface was designed from the decision technology 


taxonomy and related database schema developed in the 





previous two chapters respectively. The flowchart yields a 








simple, yet consistent interface which provides users 





several different perspectives for learning about and using 





available decision technologies at NPS. 
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Vv. SUMMARY 





Decision support technologies have remained 


individualistic in nature. The ability to access. and 








integrate a wide range of such technologies in an Integrated 


Decision Technology Environment can potentially increase a 





user’s ability to create more complex decision support 








projects. The IDTE will allow the user to learn, access and 





use available decision technologies quickly and easily. 








Data-driven DSS provide decision-makers with the 
capability to store and sort through potentially vast 


amounts of data by leveraging data warehousing and data- 





mining. These data-oriented decision technologies can assist 
decision-makers in making better and more informed decisions 


in shorter durations of time. 








This research has focused upon data-driven decision 





technologies and how they can be integrated into an overall 

















DER n the process of identifying data-driven technology 





requirements, we first specified a simple taxonomy, based 





upon their properties and roles. Four categories wer 


identified: association, Classification, clustering, and 








prediction. Next we developed a database schema for storing 


the relevant data about these technologies including 





platform data as well as case studies. Finally we designed a 





simple, yet effective, interface for navigating through the 


data-driven decision technology universe both at NPS and 





beyond. SQL commands for populating the various screens of 











the IDTE interface were provided to show proof of concept. 








The thesis clearly showed how an IDTE could be useful 





to students and researchers at NPS in their ability to 
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define more complex working models of DSS. The conceptual 





model for IDTE presents an exciting opportunity for decision 
technology development, while advancing the opportunity for 
integration of different decision technologies into a 


cohesive environment. As additional decision technologies 











are incorporated into the IDTEH, it offers the ability to 





address ever more complex decision-making challenges. 
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